Jan 30, 2014 7:27pm
Re: CDX digest not accurately capturing duplicates?
The digest isn’t the SHA-1 hash. For the above linked Google logo gif, SHA-1 hash is fd852df5478eb7eb9410ee9101bb364adf487fb0. None of the digests recorded on the CDX page is this.
There must be ways to get the original unmodified page. (I don’t know any, though.)
Contrawise, (i) the Wayback Machine almost certainly modifies the links on-the-fly, and (ii) the webpages saved are expected to differ in all imaginable and unimaginable ways. Therefore the code doing the modification must be trivial
, as it is supposed to work on every webpage saved without breaking it. I am willing to bet that it does nothing more than inserting a few lines, and changing all the urls to prefix web.archive.org/web/TIMESTAMP. As such, after comparing a few pages, once you have identified what exactly is being changed, you could be certain
-- for all practical purposes -- that by doing the opposite (i.e., deleting the two additions), you are getting the original pages back. (In case you go this route, please post what you learn on the forum too, as it would be useful to others.)
Also, note that you may not have to check the digests for all the saved pages. If you are willing to assume that same digest imply
same page (even though we already know that same page does-not-imply
same digest), you could proceed with the CDX collapse-based-on-digest result! (i.e., exactly where you started) That is to say, the CDX collapse result would have already removed the adjacent same digest records; you could use this list as your master list.
This post was modified by aibek on 2014-01-31 03:27:04